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METHOD AND APPARATUS FOR COMPRESSING TEXTUAL DOCUMENTS 

Field of the Invention 

The present invention relates to the compression of 
information, and more particularly, to the compression of textual 
documents encoded using tag-based markup languages, such as the 
Extensible Markup Language (XML) or the Standard Generalized 
Markup Language (SGML) . 

Background of the Invention 

The Extensible Markup Language (XML) is a standard for 
encoding textual information that has been recommended by the 
World Wide Web Consortium (W3C) . Likewise, the Standard 
Generalized Markup Language (SGML) is an international standard 
(ISO 8879) meta- language that predates XML and is an ancestor to 
XML. SGML is a language for describing a document structure. 
XML is a simplification of SGML that is easier to use. For a 
discussion of the XML and SGML standards, see, for example, 
Extensible Markup Language (XML) 1.0 W3C Recommendation, 
ht tp : //www . w3 . or g/TR/1 9 9 8/REC - xml -19980210; and 
http : //www . w3 org/markup/SGML/overview . html , respectively , each 
incorporated by reference herein. 

The illustrative XML standard allows XML- enabled 
applications to inter-operate with other compliant systems for 
the exchange of encoded information. XML documents store textual 
data in a hierarchical tree structure. Each XML document has one 
root node, often referred to as the root element, with the other 
nodes in the hierarchical tree being arranged as descendants of 
the root node. Each XML document contains two types of elements, 
namely, data elements and the corresponding tag elements that 
impose the hierarchical structure on the data elements. 

Since XML documents contain only textual information, 
the documents can be quite large in size. In order to reduce the 
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size of XML documents for transmission and storage, standard 
compression algorithms suitable for textual information have been 
applied to entire XML documents. While the application of such 
standard compression techniques to entire XML documents has been 
an effective technique for reducing the overall size of such XML 
documents, this technique suffers from a number of limitations, 
which if overcome, could greatly expand the efficiency and 
usefulness of the compressed XML documents. Specifically, the 
compressed XML documents generated by such conventional XML 
compression techniques must be decompressed to be useful. A need 
therefore exists for a method and apparatus that compresses XML 
documents in a manner that allows the document to be processed in 
a compressed form. 

Summary of the Invention 

Generally, a method and apparatus are disclosed for 
compressing textual documents encoded using a tag-based markup 
language, such as XML or SGML documents, in a manner that allows 
a compressed XML document to be processed without decompression. 
The present invention compresses a textual document using a 
standard compression algorithm that is applied only to the data 
elements of the document. The tag elements of the document that 
impose the hierarchical structure on the data elements are not 
compressed. Thus, the present invention allows the hierarchical 
relationship of the data elements to be ascertained from the 
compressed document. Once the hierarchical relationship of the 
data elements is obtained from the compressed document, a user 
can selectively decompress desired portions of the document, 
without decompressing the entire document. 

In one exemplary embodiment, an identification of the 
employed compression technique is inserted into a root node tag 
element of the document. In another exemplary embodiment, an 
additional tag element pair is inserted into the document and an 
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indication of the employed compression technique is inserted into 
the additional tag element pair. The present invention allows a 
decoder to utilize the uncompressed tag elements in the otherwise 
compressed document to ascertain the hierarchical structure of 
the compressed data and present the user with a corresponding 
hierarchical expression of the document. 

A more complete understanding of the present invention , 
as well as further features and advantages of the present 
invention , will be obtained by reference to the following 
detailed description and drawings . 

Brief Description of the Drawings 

FIG. 1 illustrates a representative network environment 
where the present invention may operate; 

FIG. 2A illustrates a conventional hierarchical XML 
document tree in an uncompressed format; 

FIG. 2B illustrates a portion of the corresponding 
conventional pseudo-code necessary to construct the hierarchical 
XML tree of FIG . 2A; 

FIG. 2C illustrates the pseudo-code of FIG. 2B as 
compressed in accordance with one embodiment of the present 
invention; 

FIG. 2D illustrates the pseudo-code of FIG. 2B as 
compressed in accordance with another embodiment of the present 
invention; 

FIG. 2E illustrates the hierarchical XML document tree 
in a compressed format according to the present invention; 

FIG. 3 is a block diagram showing the architecture of 
an illustrative XML transmitter in accordance with the present 
invention; and 

FIG. 4 is a flow chart describing an exemplary XML 
compression process 400 executed by the XML transmitter of FIG. 
3. 
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Detailed Description 

FIG. 1 illustrates a network environment 100 where the 
present invention may operate. As shown in FIG. 1, an XML 
transmitter 300, discussed below in conjunction with FIG. 3, 
transmits a compressed XML document to an XML receiver 110. In a 
further application of the present invention, the compressed XML 
document may be sent over the network 100 to a server (not shown) 
for storage, or stored locally by the XML transmitter 300. 

FIG. 2A illustrates an XML document tree 200, and FIG. 
2B illustrates a portion of the corresponding pseudo-code 250 
necessary to construct the XML tree 200 of FIG. 2A. As shown in 
FIG. 2A, the XML document tree 200 includes a root node 205 and a 
number of sub-nodes 210, 220, 230, 240 and 245. As shown in FIG. 
2B, an XML document, such as the document 200, contains two types 
of elements, namely, data elements and the corresponding tag 
elements that impose the hierarchical structure on the data 
elements. It is noted that in the illustrative notation used in 
FIG. 2B, each tag element is identified within braces *<>" to 
distinguish the tag elements from the data elements. 

As shown in FIG. 2B, one feature of the XML language is 
that tag elements are utilized in matched pairs, with an opening 
and closing tag element corresponding to each node. It is noted 
that additional tag element pairs that do not directly correspond 
to a given node may also be included in an XML document, in a 
known manner . 

In accordance with the present invention, the XML 
transmitter 300 compresses the XML document 200 using a standard 
compression algorithm that is applied only to the data elements 
of the document. Thus, the tag elements are not compressed. 
Among other benefits, the compression technique of the present 
invention allows the document to be validated by standard XML 
parsers without decompressing the document. In addition, the 
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present invention allows a user to work with the compressed 
document and ascertain the hierarchical relationship of the 
compressed data, without actually decompressing the data. Thus, 
the user can thereafter selectively decompress only desired 
portions of the document. 

FIGS. 2C and 2D illustrate how the pseudo-code of FIG. 
2B is compressed in accordance with two exemplary embodiments of 
the present invention. As shown in FIGS. 2C and 2D, a standard 
compression algorithm is applied to only the data elements of the 
XML document 200 and the tag elements are not compressed. In a 
first exemplary embodiment, shown in FIG. 2C, an identification 
of the employed compression technique 265 is inserted into the 
root node tag element. In a second exemplary embodiment, shown 
in FIG. 2D, an additional tag element pair 275, 276 indicating 
the employed compression technique is inserted into the pseudo- 
code 270. In both exemplary embodiments of the present 
invention, the XML provisions regarding Document Type Definitions 
(DTDs) are modified to support the indication of the employed 
compression algorithm. 

FIG. 2E illustrates the hierarchical expression of the 
XML document 200' in a compressed format according to the present 
invention. An XML decoder can utilize the uncompressed tag 
elements of the compressed XML document 200 ' to ascertain the 
hierarchical structure of the compressed data and present the 
user with the hierarchical expression shown in FIG. 2E. 

FIG . 3 is a block diagram showing the architecture of 
an illustrative XML transmitter 300 in accordance with the 
present invention. The XML transmitter 300 may be embodied as a 
general purpose computing system, such as the general purpose 
computing system shown in FIG. 3. As shown in FIG. 3, the XML 
transmitter 300 preferably includes a processor 310 and related 
memory, such as a data storage device 320, which may be 
distributed or local. The processor 310 may be embodied as a 
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single processor, or a number of local or distributed processors 
operating in parallel. The data storage device 320 and/or a read 
only memory (ROM) (not shown) are operable to store one or more 
instructions, which the processor 310 is operable to retrieve, 
interpret and execute . 

The data storage device 320 includes a text source 350 
that may be retrieved from memory or generated in real-time. 
Thus, the text source 350 may be a pre-recorded textual file, 
such as a database or another document, or a document generated 
in real-time, for example, by a user entering textual information 
from a keyboard ( not shown ) or by a speech recognition system 
(not shown). The data storage device 320 also includes one or 
more compression algorithm(s) 3 60 that are suitable for 
compressing textual information. For example, the compression 
algorithm(s) 360 may be embodied as the WinZip™ compression 
utility application, commercially available from Nico Mak 
Computing, Inc., of Mansfield, CT, as modified herein to carry 
out the features and functions of the present invention. Thus, 
the XML transmitter 300 can process the text source 350 using an 
identified compression algorithm 360 to generate the compressed 
document, in accordance with the present invention. 

The data storage device 320 also includes an XML 
compression process 4 00, discussed hereinafter in conjunction 
with FIG . 4, that compresses each data field in an XML document 
200, and leaves each tag uncompressed. 

FIG. 4 is a flow chart describing an exemplary XML 
compression process 400 executed by the XML transmitter 300 of 
FIG. 3. As previously indicated, the XML compression process 400 
compresses each data field in an XML document 200, and leaves 
each tag uncompressed. As shown in FIG. 4, the XML compression 
process 400 initially retrieves the XML document 200 to be 
compressed, for example, from the text source 350 (FIG. 3) during 
step 410. 
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Thereafter, the XML compression process 4 00 applies a 
standard compression algorithm 3 60 (FIG. 3) to only the data 
elements of the XML document 200 during step 420. The XML 
compression process 400 then inserts an identifier 265 of the 
employed compression algorithm 360 into the root node tag, in 
accordance with the embodiment shown in FIG. 2C, or inserts an 
additional tag element pair 275 indicating the employed 
compression algorithm 360 into the pseudo-code 270, in accordance 
with the embodiment shown in FIG. 2D. In this manner, the XML 
decoder can utilize the same compression algorithm 3 60 to 
decompress the compressed XML document 200. 

Finally, the XML compression process 400 transmits the 
compressed XML document 200 to a receiver 110 over the network 
100, or stores the compressed XML document 200 (remote or local 
storage). Program control then terminates during step 450. 

It is to be understood that the embodiments and 
variations shown and described herein are merely illustrative of 
the principles of this invention and that various modifications 
may be implemented by those skilled in the art without departing 
from the scope and spirit of the invention. 
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Claims 

What is claimed is : 

1. A method of compressing a textual document comprised of 
data elements and tag elements that impose a hierarchical 
structure on said data elements, said method comprising the steps 
of: 

identifying said data elements in said document; and 
compressing only said data elements in said document 
using a compression algorithm. 

2. The method of claim 1, further comprising the step of 
inserting an identifier of said compression algorithm in said 
document . 

3. The method of claim 2, wherein said step of inserting 
an identifier of said compression algorithm in said document 
inserts said identifier in a root node tag element. 

4 - The method of claim 2, wherein said step of inserting 

an identifier of said compression algorithm in said document 
further comprises the steps of inserting a new tag element in 
said document and inserting said identifier in said new tag 
element . 

5. The method of claim 1, further comprising the step of 
transmitting said compressed document. 

6. The method of claim 1, further comprising the step of 
storing said compressed document. 
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7. The method of claim 1, wherein said document is 
generated in real-time by a user operating a textual input 
device . 

8. The method of claim 1, wherein said document is 
generated in real-time by a speech recognition system. 

9. The method of claim 1, wherein said document is an XML 
document . 

10. The method of claim 1, wherein said document is an SGML 
document . 

11. A method of compressing a document , said document 
comprised of data elements and tag elements that impose a 
hierarchical structure on said data elements, said method 
comprising the steps of: 

compressing only said data elements in said document 
using a compression algorithm; and 

inserting an identifier of said compression algorithm 
in said document. 

12. The method of claim 11, wherein said step of inserting 
an identifier of said compression algorithm in said document 
inserts said identifier in a root node tag element. 

13. The method of claim 11, wherein said step of inserting 
an identifier of said compression algorithm in said document 
further comprises the steps of inserting a new tag element in 
said document and inserting said identifier in said new tag 
element . 
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14. The method of claim 11, further comprising the step of 
transmitting said compressed document. 

15. The method of claim 11, further comprising the step of 
storing said compressed document. 

16. The method of claim 11, wherein said document is 
generated in real-time by a user operating a textual input 
device . 

17. The method of claim 11, wherein said document is 
generated in real-time by a speech recognition system. 

18. A system for compressing a document, said document 
comprised of data elements and tag elements that impose a 
hierarchical structure on said data elements, said system 
comprising : 

a memory for storing content and computer readable 

code; and 

a processor operatively coupled to said memory, said 

processor configured to: 

identify said data elements in said document; and 
compress only said data elements in said document using 

a compression algorithm. 

19. A system for compressing a document, said document 
comprised of data elements and tag elements that impose a 
hierarchical structure on said data elements, said system 
comprising : 

a memory for storing content and computer readable 

code; and 

a processor operatively coupled to said memory, said 
processor configured to: 
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compress only said data elements in said document using 
a compression algorithm; and 

insert an identifier of said compression algorithm in 
said document. 

20. An article of manufacture for compressing a document, 
said document comprised of data elements and tag elements that 
impose a hierarchical structure on said data elements, 
comprising : 

a computer readable medium having computer readable 
code means embodied thereon, said computer readable program code 
means comprising: 

a step to identify said data elements in said document; 

and 

a step to compress only said data elements in said 
document using a compression algorithm. 

21. An article of manufacture for compressing a document, 
said document comprised of data elements and tag elements that 
impose a hierarchical structure on said data elements, 
comprising: 

a computer readable medium having computer readable 
code means embodied thereon, said computer readable program code 
means comprising : 

a step to compress only said data elements in said 
document using a compression algorithm; and 

a step to insert an identifier of said compression 
algorithm in said document. 
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ABSTRACT 

A method and apparatus are disclosed for compressing 
textual documents encoded using a tag-based markup language, , 
such as XML or SGML documents, in a manner that allows a 
compressed document to be processed without decompression. A 
document is compressed using a standard compression algorithm 
that is applied only to the data elements of the document. The 
tag elements of the XML document that impose the hierarchical 
structure on the data elements are not compressed. The 
hierarchical relationship of the data elements can be ascertained 
from the compressed document. A user can thereafter selectively 
decompress desired portions of the document, without 
decompressing the entire document. An identification of the 
employed compression technique can be inserted into a root node 
tag element of the XML document or into an additional tag element 
pair that is inserted into the XML document. An XML decoder can 
utilize the uncompressed tag elements to ascertain the 
hierarchical structure of the compressed data and present the 
user with a corresponding hierarchical expression of the 
document . 



]C:\TEMP\701299.doc 



12 



200 



NODE 


A 






210 





HELLO 



XXX 



PRIOR ART 




FIG. 2A 



YYY 



ZZZ 



<ROOT> 

<A> HELLO </A> 
250 -< <B>XXX</B> 

<C> <D> YYY </D> <E> ZZZ </E> </C> 
</ROOT> 



FIG. 2B 



PRIOR ART 



265 



260 



r 

<ROOT.COMPRESSION=ZIP> 

<A> A compressed data A </A> 

<B> A compressed data A </B> 

<C> <D> A compressed data A </D> <E> 

A compressed data A </E> </C> 

</ROOT> 



FIG. 2C 



^75 



270 



FIG. 2D 



<COMPRESSED TYPE=ZIP> 
<ROOT> 

<A> A compressed data A </A> 

<B> A compressed data A </B> 

<C> <D> A compressed data A </E» <E> 

A compressed data A </E> </C> 

</ROOT> 

</COMPRESSED> 

276 



200' 



NODE 








210 





^compressed data A 



^compressed data^ 




NODE 


] 


E 




245 







FIG. 2E 



^compressed data' 



A A, 



compressed data A 



XML 
TRANSMITTER 

300 



6 

m 
m 

m 
m 
o 

a 
w 
o 

O 



PROCESSOR 
310 



TO 

> COMPUTER 
NETWORK 
100 



DATA STORAGE 
DEVICE 320 



TEXT SOURCE 



COMPRESSION ALGORITHM(S) 



XML COMPRESSION PROCESS 



350 



360 



400 



FIG. 3 



XML COMPRESSION 
PROCESS --400 



RETRIEVE XML DOCUMENT 200, IE., FROM TEXT 
SOURCE 350 



410 



APPLY COMPRESSION ALGORITHM ONLY TO DATA 
ELEMENTS OF XML DOCUMENT 200 



420 



m 



INSERT IDENTIFIER OF COMPRESSION ALGORITHM INTO 
ROOT NODE TAG -FIG. 2C (OR INSERT ADDITIONAL TAG 
ELEMENT PAIR INDICATING EMPLOYED COMPRESSION 
TECHNIQUE INTO PSEUDO-CODE - FIG. 2D) 

430 



TRANSMIT COMPRESSED XML DOCUMENT TO RECEIVER 
OR STORE COMPRESSED DOCUMENT 



440 




FIG. 4 



IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 
In re Application of Atty. Docket 

RAYMOND KRASINSKI US000284 

Filed: CONCURRENTLY 

Title: METHOD AND APPARATUS FOR COMPRESSING TEXTUAL DOCUMENTS 
Commissioner for Patents, Washington, D.C. 20231 

APPOINTMENT OF ASSOCIATES 

Sir: 

The undersigned Attorney of Record hereby revokes all 
prior appointments (if any) of Associate Attorney (s) or Agent (s) in 
the above-captioned case and appoints: 

GREGORY L. THORNE (Registration No. 39,398) 

c/o U.S. PHILIPS CORPORATION, Intellectual Property Department, 580 
White Plains Road, Tarrytown, New York 10591, his Associate 
Attorney (s) /Agent (s) with all the usual powers to prosecute the 
above-identified application and any division or continuation 
thereof, to make alterations and amendments therein, and to 
transact all business in the Patent and Trademark Office connected 
therewith . 

ALL CORRESPONDENCE CONCERNING THIS APPLICATION AND THE 
LETTERS PATENT WHEN GRANTED SHOULD BE ADDRESSED TO THE UNDERSIGNED 
ATTORNEY OF RECORD. 



Dated at Tarrytown, New York 
on October 23, 2000. 



Re^ppctf ulily- 

feck E. Haken, Reg. 26,902 
ttorney of Record 




F:\WPDOCS\TH\US000284.APP.doc 



701299 

DECLARATION and POWER OF ATTORNEY 

As a below named inventor, I hereby declare that: 

My residence, post office address and citizenship are as stated below next to my name. 

I believe I am the original, first and sole inventor (if only one name is listed below) or an original, first and joint 
inventor (if plural names are listed below) of the subject matter which is claimed and for which a patent is sought on the 
invention entitled Method and Apparatus for Compressing Textual Documents 
the specification of which (check one) 
X is attached hereto. 

was filed on as Application Serial No. and was 

amended on 

(if applicable). 

I hereby state that I have reviewed and understand the contents of the above-identified specification, including the 
claims, as amended by the amendment(s) referred to above. 

I acknowledge the duty to disclose information which is material to the patentability of this application in accordance 
with Title 37, Code of Federal Regulation, 3l. 56(a). 

I hereby claim foreign priority benefits under Title 35, United States Code, 3 1 19 of any foreign application(s) for 
patent or inventor's certificate listed below and have also identified below any foreign application for patent or inventor's 
certificate having a filing date before that of the application on which priority is claimed: 



PRIOR FOREIGN APPLICATION(S) 



COUNTRY 


APPLICATION 
NUMBER 


DATE OF FILING 
(DAY, MONTH, YEAR) 


PRIORITY CLAIMED 
UNDER 35 U.S.C. 119 











I hereby claim the benefit under Title 35, United States Code, 3120 of any United States application (s) listed below 
and, insofar as the subject matter of each of the claims of this application is not disclosed in the prior United States 



application in the manner provided by the first paragraph of Title 35 United States Code, al 12, 1 acknowledge the duty 
to disclose material information as defined in Title 37, Code of Federal Regulations, 31.56(a) which occurred between 
the filing date of the prior application and the national or PCT international filing date of this application: 



PRIOR UNITED STATES APPLICATION(S) 



APPLICATION SERIAL 
NUMBER 


FILING DATE 


STATUS (PATENTED, PENDING, 
ABANDONED) 









I hereby declare that all statements made herein of my own knowledge are true and that all statements made on 
information and belief are believed to be true; and further that these statements were made with the knowledge that 
willful false statements and the like so made are punishable by fine or imprisonment, or both, under Section 1001 of Title 
18 of the United States Code and that such willful false statements may jeopardize the validity of the application or any 
patent issued thereon. 

POWER OF ATTORNEY: As a named inventor, I hereby appoint the following attorney(s) and/or agent(s) to 
prosecute this application and transact all business in the Patent and Trademark Office connected therewith, (list name 



and registration number) 




Algy Tamoshunas, Reg. No. 27,677 




Jack E. Haken, Reg. No. 26,902 




SEND CORRESPONDENCE TO: 


DIRECT TELEPHONE CALLS TO: 


Corporate Patent Counsel; i 


Gregory L. Thorne 


U.S. Philips Corporation; 580 White Plains Road; 


(914) 333-9665 


C:\7EMPv0059656.doc 1 of 2 



701299 

Tarrytown,NY 10591 I 





Dated: 


Inventor's Signature: 




Full Name 
of 

Inventor 


Last Name: 

Krasinski 


First Name : 

Raymond 


Middle Name: 




Residence 
& 

Citizenship 


City 


State or Foreign Country 


Country of Citizenship 




Suffern 


New York 


United States of America 




Post 

Office 

Address 


Street 

5 Reigate Place 


City 

Suffern 


State or Country 

New York 


Zip Code 

10901 


® 

Li J 














Dated: 


Inventor's Signature: 




Full Name 
of 

Inventor 


Last Name: 


First Name : 


Middle Name: 




Residence 
& 

Citizenship 


City 


State or Foreign Country 


Country of Citizenship 














Post 

Office 

Address 


Street 


City 


State or Country 


Zip Code | 



C:\rEMIV0059656.doc 



2 of 2 



